Navigating the E. coli functional landscape

Author

Daniel Martinez-Martinez, Filipe Cabreiro

Introduction

The partition of the existing bacteria into biological units have been debated over time, leaving us with a working framework where bacterial species can be divided based based on their genetic content. However, bacterial species are non-stationary groups that are in constant evolution, dynamically gaining and losing genetic material while keeping a core of functions unchanged even across large geographical distances. Moreover, the functional content of a given pair of strains can largely differ due to their evolutionary history, meaning that they can have distinct ways to operate with their environment. Besides, the functional characterization of these strains remains an open challenge due to their, sometimes, extremely large diversity and lack of experimental validation. This project is meant to explore the functional landscape of the Escharichia coli pangenome, one of the most sequenced bacterial species at this moment.

Results

Filtering strains to be analysed

Contigs and Scaffold status

Code
library(tidyverse)
library(readr)
library(readxl)
library(cowplot)
library(viridis)
library(patchwork)

theme_set(theme_cowplot(14))

Strain information has been downloaded from NCBI genomes. The metadata was downloaded on January 11th 2024. To reduce the number of E. coli genomes, the assembly level allowed was only for the types of “chromosome”, “complete genome” or “scaffold”. We can have a look at parameters such as the contig_n50/scaffold_n50 to see how the distribution of contigs looks like in the full cohort. We can see two different peaks, one near 0 that represents all the assemblies that have multiple contigs, and another around 5 megabases

Code
metadata = read_xlsx('ecoli_ncbi_metadata.xlsx')

metadata = metadata %>% 
  distinct(assembly_name, .keep_all = T)

metadata = metadata %>% 
  # remove the string after the dot in assembly_id
  mutate(assembly_id_simp = str_remove(assembly_id, "\\..*"),
         .before = "assembly_name")

metadata
Code
metadata %>% 
  ggplot(aes(contig_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Contig N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Code
# plot distribution of scaffold_n50
metadata %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

It is even more evident that there are huge differences in terms of the scaffold N50 parameter in respect to their assembly level. As expected, we have a greater count of scaffold at chromosome or complete genome level compared to scaffold. The assemblies from the first two categories will be included, and now we need to include a threshold for the last category to remove strains with low scaffold N50.

Code
# plot scaffold distribution by assembly level with violin plot
metadata %>% 
  ggplot(aes(assembly_level, scaffold_n50, fill = assembly_level)) +
  geom_violin(show.legend = F) +
  # plot mean and sd as pointrange
  stat_summary(fun.data = mean_sdl, geom = "pointrange", 
               fun.args = list(mult = 1), color = "black",
               show.legend = F) +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Assembly level", y = "Scaffold N50") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis(discrete = T, begin = 0.2, end = 0.8) +
  theme_cowplot(14)

When we filter out the complete categories and then show the distribution of the scaffold N50, we have this distribution:

Code
# show scaffold distribution of assembly_level == scaffold
p1 = metadata %>% 
  filter(assembly_level == 'Scaffold') %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_cowplot(14)

p2 = metadata %>% 
  filter(assembly_level == 'Scaffold') %>%
  filter(scaffold_n50 < 500000) %>% 
  ggplot(aes(scaffold_n50)) +
  geom_histogram(binwidth = 10000) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Scaffold N50", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 


# show them with patchwork
p1 + p2

Left: Distribution of scaffold N50 in the full cohort. Right: Distribution of scaffold N50 in the filtered cohort with a threshold of 150K.
Code
metadata_filt = metadata %>% filter(scaffold_n50 > 100000)
dims = dim(metadata_filt)
dims[1]
[1] 11860

From the distribution of the scaffold N50 values, we can safely chose a threshold of ~ 150K as that will leave out E. coli assemblies that have a high amount of contigs/scaffolds.

In summary: we are choosing a threshold of 150K for the scaffold_n50 parameter from now on. Anything below that will be discarded.

After this filtering step we have 11860 genomes left.

CheckM filters

We need to filter the bad quality genomes from the cohort, which we can start doing by filtering out by checkM metricsL:

  • checkM completeness: the value that is usually applied ranges from 70%, but I’ll be using a more strict value of 95% because I want to have full genomes when possible.

  • checkM contamination: I am applying a value of 1% max contamination to avoid getting false information regarding the gene content.

Code
metadata_filt %>% 
  mutate(checkM_completeness_class = 
           case_when(is.na(checkM_completeness) ~ 'NA',
                     checkM_completeness < 95 ~ "< 95%",
                     checkM_completeness >= 95 ~ ">= 95%")) %>% 
  count(checkM_completeness_class) %>% print
# A tibble: 2 × 2
  checkM_completeness_class     n
  <chr>                     <int>
1 < 95%                       122
2 >= 95%                    11738
Code
metadata_filt %>% 
  mutate(checkM_contamination_class = 
           case_when(is.na(checkM_contamination) ~ 'NA',
                     checkM_contamination > 1 ~ "> 1%",
                     checkM_contamination <= 1 ~ "<= 1%")) %>% 
  count(checkM_contamination_class) %>% print
# A tibble: 2 × 2
  checkM_contamination_class     n
  <chr>                      <int>
1 <= 1%                       9421
2 > 1%                        2439
Code
metadata_filt = metadata_filt %>% 
  filter(checkM_completeness > 95,
         checkM_contamination < 1) %>% 
  filter(seq_length < 7000000)

Finally, after calculating the number of contigs per assembly, I have seen that some of them have a high number of contigs, which is not a good sign for the quality of the assembly. I have decided to filter out genomes with more than 300 contigs.

Code
metadata_filt = read_csv("tables/ecoli_ncbi_metadata_filt.csv")

p1 = metadata_filt %>% 
  # filter(Contig_number < 500) %>% 
  ggplot(aes(Contig_number)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Number of contigs", y = "Number of genomes") +
  theme_cowplot(14)

p2 = metadata_filt %>% 
  filter(Contig_number < 300) %>% 
  ggplot(aes(Contig_number)) +
  geom_histogram(binwidth = 1) +
  scale_x_continuous(labels = scales::comma) +
  labs(x = "Number of contigs", y = "Number of genomes") +
  theme_cowplot(14)

p1 + p2

Left: Distribution of contig number in the filtered cohort. Right: Distribution of contig number in the filtered cohort with a threshold of 300 contigs.
Code
metadata_filt = metadata_filt %>% 
  filter(Contig_number < 300)

Another filter that I have applied is using the phylogroup characterization of the strains with the tool EzClermont1. As we can see in the next plot with the phylogroup distributions, we have some that failed or that belong to cryptic groups. I have discarded these.

Code
phylogroups = read_tsv("tables/phylogroups_ezclermont.tsv",
         col_names = F) %>% 
  rename(genome_id = `X1`,
         phylogroup = `X2`) %>% 
  mutate(assembly_id = str_sub(genome_id, 1, 15))

phylogroups %>% 
  ggplot(aes(phylogroup, fill = phylogroup)) +
  geom_bar(show.legend = F) +
  labs(x = "Phylogroup", y = "Number of genomes") +
  theme_cowplot(14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
library(tidyverse)
metadata_final = read_csv("tables/MAIN_ecoli_NCBI_metadata.csv") 

dim_filt = metadata_final %>% dim

Summarising all the filtering steps I have performed:

  1. filtered out genomes with a scaffold N50 lower than 150K.

  2. filtered out genomes with a checkM completeness lower than 95%.

  3. filtered out genomes with a checkM contamination higher than 1%.

  4. filtered out genomes with a sequence length greater than 7Mbp.

  5. filtered out genomes with a contig number higher than 300.

  6. filtered out genomes that had a higher mash difference than 0.05.

  7. filtered out genomes that were not classified as one of the main E. coli phylogroups

Lab strains

In addition to the strains from the NCBI database, I have included the strains we are using in our lab. These strains are not part of the NCBI database, so they are not included in the filtering steps. In total we are adding 729 strains that belong also to the normal phylogroups as before.

Final number of strains

In summary, if we apply all filters, we are left with 9566 genomes.

Filtering strains to be analysed

The distribution of the different phylogroups represented in the E. coli cohort follow a similar trend as in other research publications2. The phylogroup B1 is the most represented, followed by phylogroup A and B2. The rest of the phylogroups are less represented, with phylogroup U being the least represented.

Code
phylo_colors = c("A" = "#1b9e77",
                 "B1" = "#d95f02",
                 "B2" = "#7570b3",
                 "D" = "#e7298a",
                 "E" = "#66a61e",
                 "F" = "#e6ab02",
                 "G" = "#a6761d",
                 "C" = "#3A254D",
                 "U" = "#666666")

metadata_final %>% 
  ggplot(aes(phylogroup, fill = phylogroup)) +
  geom_bar(show.legend = F) +
  labs(x = "Phylogroup", y = "Number of genomes") +
  scale_fill_manual(values = phylo_colors) +
  theme_cowplot(14) 

Methods

E. coli strain selection

The E. coli strains were selected from the NCBI genome database. The metadata was downloaded on January 11th 2024. The strains were filtered based on the following criteria: genomes that were not at the assembly level of “chromosome”, “complete genome” or “scaffold” were removed. The scaffold N50 was used to filter out genomes with a value lower than 150K. The checkM metrics were used to filter out genomes with a completeness lower than 95% and a contamination higher than 1%. Mash distances were calculated pairwise between all genomes, and those distances that were higher than 0.05 were removed. Finally, genomes with a sequence length greater than 7Mbp, and/or genomes with a contig count higher than 300 were removed. Genomes were dowloaded with the NCBI dataset download tool (v. 16.2.0). The final number of genomes was 9566.

Gene annotation and pangenome analysis

The gene annotation was performed with Bakta (v. 1.9.3) using the full database (v. 5.1) and by default parameters. E. coli phylogroups were determined with EzClermont (v. 0.7.0)1 based on the approach from CermonTyping3. Genomes that did not classify into the main E. coli phylogroups or that were cryptic were discarded for following analyses.

Software used and versions

  • R version 4.3.2
  • NCBI dataset download tool v. 16.2.0
  • Bakta v. 1.9.3
  • Panaroo v. 1.5.0
  • mash v. 1.1
  • EzClermont v. 0.7.0

References

1.
Waters, N. R., Abram, F., Brennan, F., Holmes, A. & Pritchard, L. Easy phylotyping of escherichia coli via the EzClermont web app and command-line tool. Access Microbiology 2, (2020).
2.
3.
Beghain, J., Bridier-Nahmias, A., Le Nagard, H., Denamur, E. & Clermont, O. ClermonTyping: An easy-to-use and accurate in silico method for escherichia genus strain phylotyping. Microbial Genomics 4, (2018).